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Abstract 

It has been shown that AlC-type criteria are asymptoticahy efficient selectors of the 
tuning parameter in non-concave penahzed regression methods under the assumption 
that the population variance is known or that a consistent estimator is available. We 
relax this assumption to prove that AIC itself is asymptotically efficient and we study 
its performance in finite samples. In classical regression, it is known that AIC tends to 
select overly complex models when the dimension of the maximum candidate model is 
large relative to the sample size. Simulation studies suggest that AIC suffers from the 
same shortcomings when used in penalized regression. We therefore propose the use of 
the classical corrected AIC (AICc) as an alternative and prove that it maintains the 
desired asymptotic properties. To broaden our results, we further prove the efficiency of 
AIC for penalized likelihood methods in the context of generalized linear models with 
no dispersion parameter. Similar results exist in the literature but only for a restricted 
set of candidate models. By employing results from the classical literature on maximum- 
likelihood estimation in misspecified models, we are able to establish this result for a 
general set of candidate models. We use simulations to assess the performance of AIC 
and AICc, as well as that of other selectors, in finite samples for both SCAD-penalized 



and Lasso regressions and a real data example is considered. 

KEY WORDS: Akaike information criterion; Least absolute shrinkage and selection 
operator (Lasso); Model selection/ Variable Selection; Penalized likelihood; Smoothly 
clipped absolute deviation (SCAD). 



1 Introduction 

Regularized (or penalized) likelihood methods have become widely used in recent years due 
to the increased availability of large data sets. These methods operate by maximizing the 
penalized likelihood function 

-m-Y.p.m) (1.1) 

with respect to /3 G M'^", where /(/5) is the working log-likelihood function, dn is the total 
number of predictors, and p\{-) is a, penalty function that penalizes against model complexity 
and the size of the estimated coefficients. The working log-likelihood is used to justify the 
first part of the function (e.g., in Least Squares, the working log-likelihood is based on the 
Gaussian distribution). As demonstrated in Sections 2 and 3, many of the results discussed 
in this paper are valid even if the working log-likelihood is misspecified. With these methods, 
increasing the amount of regularization increases the number of estimated coefficients that are 
set equal to zero thus performing "automatic" variable selection through the data- dependent 
choice of the regularization parameter, A. In contrast, variable selection in classical regression 



is commonly done using the Leaps and Bounds algorithm (Furnival and Wilson, 1974), which 
becomes infeasible when the number of predictors is much larger than 30 (Hastie et al. , 2009). 
For most penalty functions efficient algorithms exist to compute the estimated models over 
a regularization path making it possible to do variable selection in high dimensions. 

The performance of the estimated model heavily depends on the choice of the regulariza- 
tion parameter. In regularized regression several classical model selection procedures have 
been heuristically applied as selectors of this parameter including information criteria such 
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as Akaike's information criterion {AIC; Akaike, 1973), the Bayesian information criterion 



{BIC; Schwarz, 1978), and Generalized cross-validation {GCV; Craven and Wahba, 1978) 



as well as data-based selection procedures such as fc-fold cross-validation (see, e.g.. Fan and 



Li| [200T| |Zou et al.j [20071 |Wang et al.j [20071 and [Zhang et al.] [20T0[ for applications of these 
selectors to penalized regression estimators). The statistical properties of these model selec- 
tion procedures have been widely studied in the context of classical regression and an ongoing 
research problem is to determine if these properties carry over to the context of penalized 
regression. 

The asymptotic performance of model selection procedures can be studied under two im- 
portant and distinct settings: (1) when the true model is not among the candidate models 
(the "non-true model world") and (2) when the true model is among the candidate models 
(the "true model world"). In the non-true model world a reasonable goal is efficient model 
selection, meaning that we would like to select the model that asymptotically performs the 
best amongst the candidate models. In contrast, in the true-model world most of the lit- 
erature focuses on consistent model selection, meaning that the probability that the true 
model is chosen is asymptotically one. In general, a model selection procedure cannot be 



both consistent and efficient (Shao, 1997 Yang, 2005). Although the non-true model world 



has been extensively studied in classical regression (e.g., Shibata, 1981, Li, 1987, Hurvich 



and Tsai 1989 1991 Shao 1997 and Burnham and Anderson, 2002) the majority of the 



research on model selection in penalized regression has focused on the true model world (e.g.. 



Leng et al. 


2006 


Zou et al. 


2007 



world is more realistic in many situations since the data-generating process is likely to be too 
complex to know exactly; this is the essence of George Box's famous admonition that "all 



models are wrong, but some are useful" (Box, 1979). This setting should be of particular 
interest to researchers and data analysts in areas such as social science and environmental 
health where a large number of predictors are expected to influence the dependent variable 



(too many to include in model fitting; Gelman, 2010) as well as machine learning where the 
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goal is typically not to uncover the true data generating process but rather to find a model 
that can predict well. 

In the context of generalized linear models (GLMs), Zhang et al. ( 2010[ ) (hereafter ZLT) 
proposed the use of a "GlC-type" criterion, 



n n 



for choosing the regularization parameter A for non-concave penalized estimators in both the 
non-true model world and the true-model world. Here 0^ is the estimator that maximizes 



(1.1) for a specific A, dfx is the effective degrees of freedom and the log-likelihood function 



corresponds to a member of the exponential family, i.e. 



i=l 



where the form of functions a(-), 6(-), and c(-, ■) depends on the specified distribution and 



is the dispersion parameter (see e.g. McCuUagh and Nelder, 1989). They showed that 
"AlC-type" versions of GIC,^^ (k„ — )■ 2) are efficient in the former case, while "BIC-type" 
versions of GICk„ {i^n oo and ^/n — )■ 0) are consistent in the latter case. 

In the Gaussian model, GIC^^ takes on a form that includes the true error variance a^, 
and the proofs operate under the assumption that this is known or that a consistent estimator 
is available. However, if the true model is not included in the set of candidate models then 



a consistent estimator of the true error variance may not be available (Shao, 1997) making 



the efficiency proofs of ZLT not applicable in practice. This motivates us to extend the ZLT 
results in various ways. First, we show that the feasible version of GIC2, which corresponds 



to the well-known C„ measure (Mallows, 1973), is in fact efficient in the non-true model 



world. Second, we show that AIC and GCV, which do not require a consistent estimator 
of cr^, are also efficient. Third, we show that although several model selection procedures 
may be asymptotically optimal, performance varies in finite samples. Specifically, we study 
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performance when the number of predictors is allowed to be large relative to the sample size 
and show that AIC, BIC, Cp, and GCV all have a tendency to sometimes catastrophically 



overfit (lead to A values approaching 0). In classical regression Hurvich and Tsai (1989) 
showed that AIC has a tendency to select overly complex models when the dimension of 
the maximum candidate model is large relative to the sample size and proposed a corrected 
version of AIC (AICc). We show that AICc is also efficient, but avoids the tendency to 
select overly complex models. We use Monte Carlo simulations to illustrate the properties of 
these methods in finite samples and compare their performance against the data- dependent 
method 10-fold CV. 

For GLMs where there is no dispersion parameter (e.g., probit and logistic regression or 
the Poisson log-linear model), there is no difference between CIC2 and AIC. However, in 
their proof ZLT restrict the set of candidate models to ones where the estimated parameter 
converges in probability to the true parameter uniformly. To weaken this assumption we em- 



ploy the result from White (1982) that the maximum-likelihood estimator converges almost 
surely to a "pseudo-true" parameter (the parameter that minimizes the Kullback-Leibler 
(KL) loss function) when the model is misspecified and prove the efficiency of AIC under a 
weaker set of assumptions. These results, and the results for the Gaussian model, apply to a 
wide range of penalized likelihood estimators, including both non-concave penalized estima- 
tors and the well-known Least absolute shrinkage and selection operator (Lasso) estimator 



(Tibshirani, 1996). 

The remainder of the paper is organized as follows. Section 2 focuses on penalized re- 
gression and establishes the efficiency results for Cp, AIC, GCV and AICc without the 
assumption that the true population variance is known or that a consistent estimator exists. 
Section 3 focuses on GLMs where there is no dispersion parameter and establishes the effi- 
ciency of AIC for a general set of candidate models. Section 4 presents simulation results 
that explore the finite-sample behavior of the different selectors when the number of predic- 
tors is allowed to be large relative to the sample size. An empirical example that highlights 
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the varying performance of the selectors is presented in Section 5. Concluding remarks are 
given in Section 6. The main proofs are included in the appendix with some auxiliary results 
included in the supplementary material. 

2 Gaussian Model 

For ease of notation, in this section, and for the remainder of the paper, we suppress the 
subscript n where we feel it is clear that a variable depends on the sample size. 
To study model selection in regularized regression we consider the model 

y = M + e, 

where y = {yi, . . . , is the n x 1 response vector, n — (ni, . . . , //r)"^ is a n x 1 unknown 
mean vector and the entries of the n x 1 error vector e are independent and identically 
distributed (iid) with mean and variance a^. The mean vector is estimated by fix — 
where X = (xi, . . . , x„)-^ is an xdn deterministic matrix of predictors and fS^ is the estimator 
that minimizes the penalized least squares function 

1 " 

^ i=i j=i 

with respect to /3 e R'^". 

Adopting the notation from ZLT, we let the index set An denote the class of all candidate 
models and we assume that a = {1, . . . ,dn} is the largest model in An- For any a e An, 
we define da to be the number of predictor variables included in the candidate model. We 
further define the least squares estimated mean vector by /Uq, = ^a0a where Xq, is the 
matrix of predictors that are included in candidate model a and 0^ is the corresponding 
vector of the estimated least squares coefficients. The associated projection matrix is ~ 
Xa(X'Q,XQ,)~^X'Q. For a given A, we define ax to be the model a e An whose predictors 
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are those with non-zero coefficients in the penahzed estimator f3^ and let dfx denote the 
effective degrees of freedom. The least squares estimated mean vector based on the model 
ax is denoted by ft^^ = ^ax^ax- ^^^^ equation, Xq,^ is the matrix of predictors whose 
coefficients are not shrunk to zero in the penalized estimator f3)^ and f3^^ are the estimated 
coefficients from the least squares model fit using these predictors. The associated projection 
matrix in this case is defined as = ^axO^' ax^ax)~^^' ax- 

If we assume that we are in the non-true model world, then a reasonable goal is efficient 
model selection. The L2 loss is commonly used to assess the predictive performance of an 
estimator and is calculated as 



n 



If we let A„ denote the regularization parameter selected by a given selection procedure, then 
the procedure is defined to be asymptotically loss efficient if 

infAe[o,A^„.] L0x) ^ 

and 0^^ is said to be an asymptotically loss efficient estimator. 

For the efficiency proofs we further require the following notation. In classical regression 
the risk function is defined as 



n I n ' 



where Eq denotes expectation under the true model and = — Ha/ulp/n. Letting 
denote the number of predictors with non-zero coefficients in the penalized estimator we 
further define the function 



R{f3ax)=^ax+ ^ 



ax 



which is a random variable. 
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2.1 Model Selection Procedures 

K-fold CV is commonly used to select tuning parameters in both the statistical and machine 
learning literature. It operates by first randomly dividing the data set into k roughly equally 
sized subsets, then for each subset, the prediction error is computed based on the model 
fit using the data excluding that subset. The tuning parameter that minimizes the average 
square error computed across the subsets is then selected. In classical regression it has been 
shown that /c-fold CV should have the same asymptotic properties as GIC^^ with 

_2k~l 
k-1 



(Shao, 1997). Applying this result, 10-fold CV should have the same asymptotic performance 



as GIC^^ with k„ = 2.11, suggesting that 10-fold CV should be efficient. Under the assump- 



tion of an orthonormal design matrix Leng et al. (2006) showed that if the Lasso-estimated 



model minimizes the prediction error then it will fail to select the true model with non-zero 
probability. The authors noted that this suggests that /c-fold CV is inconsistent, but to our 
knowledge, the asymptotic properties of /c-fold CV have not been fully established in the 
context of penalized regression. While a rigorous extension of the classical theory for fc-fold 
CV to penalized regression is beyond the scope of this paper, the simulation results suggest 
that the k-fold CV is efficient in the current context. 

In addition to 10-fold CV, we study the performance of several information criteria. 
Specifically, we consider 

A/CA=log(<TD+2- 



AIC,,=\og{ai) + 2 



n 



2n , o c(fA + 1 



n - dfx 



BIC, = \og{al)+\og{n)^, 



n 
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and 



In the above we define 



and 



Pa 



a' 



dfxa 



S-2 



n 



n 



n — dn — 1 

With the exception of 10-fold CV, all of the above model selection procedures require a 
definition of the effective degrees of freedom for the penalized regression method. In what 
follows, we use a heuristic definition and define the effective degrees of freedom to be the 
number of non-zero coefficients in (3^ and denote this by c^q,^. 



Zou et al. 



(2007) proved that 



the number of non-zero coefficients is an unbiased estimator of the degrees of freedom for the 



Lasso. For SCAD, Fan and Li (2001) proposed setting the degrees of freedom equal to the 



trace of the approximate linear projection matrix. Based on Proposition 1 from ZLT, our 
efficiency proofs would still hold if this alternate definition is used. 



2.2 Efficiency Results 

We show here that assuming that the true model is not in the set of candidate models, 
Cp^, AICx, GCVx, and AICc^ are efficient selectors of the regularization parameter. The 
dimension of the full model, dn, is allowed to tend to infinity with n but it is assumed that 
dn/n — )■ 0. The efficiency proofs operate under the same assumptions as those of ZLT, which 
are presented here for completeness: 

(Al) (^X'X)^^ exists and its largest eigenvalue is bounded by a constant number C. 
(A2) Ee'i' < oo, for some positive integer q. 



9 



(A3) The risks of the least squares estimators 0^ satisfy 



(A4) 



sup -z-^ 0, 

where b is a (i„ x 1 vector where 6j = p\{\(3xi\)sgn0Xi) for all i such that \(3xi\ > and 
is equal to otherwise. 

The first three assumptions are common in the literature on model selection. Assumption 
(Al) requires the matrix of predictors to have full column rank and (A2) implies that efficiency 
can still apply even when penalized least squares is used but the true distribution of the error 
terms is not Gaussian. Assumption (A3) puts a restriction on how close the candidate models 
can be to the true model and precludes any scenario where the true model is included in the 
set of candidate models. The last assumption, (A4), is the only assumption that involves the 
penalty function and ZLT provided the following three sufficient conditions for the assumption 
to be satisfied. 

(51) y/nXiaax < Mi ioT all n for some constant Mi > 0. 

(52) For any 6, p'{9) < M2X for some constant M2 > 0. 

(53) nW/j, — Hal-iW^/dn — )■ cxD as — )■ 00. 

As pointed out by an anonymous referee, assumption (A3) restricts the size of the set of 
candidate models. The classical literature on model selection primarily worked with nested 



subsets and did not require the consideration of all subsets (e.g., Shibata (1981), Shao (1997), 



and Li (1987)); however, since the subsets selected by methods such as the Lasso or SCAD are 
data dependent, the set of candidate models is random and we cannot rule out any particular 
candidate model a priori. Therefore we need An to include all 2*^" subsets in order to use 
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the theory from classical model selection. Alternatively, if the data analyst can assume that 
the error terms are normally distributed then assumption (A3) can be replaced by a weaker 



assumption from Shibata (1981). 



(A3*) For any < 5 < 1, Eae^„ ^ Q, 

The following lemma details the restrictions on the behavior of dn- 
Lemma 2.1. Assume that for all n sufficiently large 

11/^ - if^/^lP > A;inrf^2 (^2.1) 
for some positive constant ki and some constant k2 < 0. Then (A3) will hold if 

Urn < q, (2.2) 

ra-s>oo log2(nj 

and (AT) will hold if 

lim n^^-i = oo. (2.3) 

n—^oo 

The proof is presented in the appendix. This lemma shows that under (A3) dn can at most 
grow logarithmically with n; however, polynomial growth rates are allowed under assumption 
(A3*) so long as dn = n'^ for c < jz]^- Specific values of ^2 are worked out for the simulation 
examples considered in Section 4.1. 

The asymptotic efficiency of Cp^ is given by the following result. 

Theorem 2.1. Assuming (A1)-(A4) hold and that dn/n — )■ as n ^ oo, the regularization 
parameter, A„, selected by minimizing Cp^ yields an asymptotically loss efficient estimator, 

/3n(An). 

To further establish the efficiency of AICx, GCVx and AICc^ we require the following 
two theorems. The first proves the efficiency of GIC\ with the true error variance replaced 
by the estimated error variance based on the candidate model. 
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Theorem 2.2. Assuming (Al)-(A4) hold and that dn/n — as n ^ oo, the regularization 
parameter, A„, selected by minimizing 



r„(A) = crl{l + 



2d 



n 



yields an asymptotically loss efficient estimator, 13^^ . The same result holds under normality 
of the error terms with (A3*) replacing (A3). 

Next, we prove that any procedure that is asymptotically equivalent to r„(A) is also 
efficient. 

Theorem 2.3. Assuming (A1)-(A4) hold and that dn/n — )■ as n — )■ oo, any information 
criterion that can he written in the form 



where 



and 



sup \5x\ -^p 

max\ 



sup ^ , ^ ^ -^p U, 



(CI) 



(C2) 



Ae[o,A,„a^] L{/3x) 

is an asymptotically loss efficient procedure for selecting A. The same result holds under 
normality of the error terms with (A3*) replacing (A3). 



Condition (C2) in Theorem 2.3 is a stronger assumption than in the analogous result es- 



tablished by Theorem 4.2 in Shibata (1980 ) for selecting the optimal order of a linear process, 



but Theorem 2.3 is sufficient to show that AICx, GCV\, and AICc^ are asymptotically loss 



efficient model selection procedures for the regularization parameter. All three methods can 



be shown to satisfy (CI) and (C2) using Taylor series expansions. The details are provided 



in the supplementary material. 
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Remark 1. The efficiency proofs in this section make use of the results from Li (1987), 
which operate under assumptions (A1)-(A3). Similar results exist in [Shibata (1981) if the 
error terms are normally distributed and (A3*) is substituted for (A3). The efficiency of 
AIC\, AICc^, and GCVx can be shown in a similar manner in this setting. 



3 GLMs with No Dispersion Parameter 

We now generalize our efficiency results to a broader class of models by studying the asymp- 
totic performance of AIC\ as a selector of A when the likelihood function is misspecified as a 
generalized linear model (GLM) and prove that it is asymptotically loss efficient. We assume 
that the data . . . , are independent with common unknown probability density function 
gill) and that E(?/j) = /ij and Var(?/j) = af. To approximate this distribution, we consider a 
family of GLMs where the density of each candidate model is given by 

fciVi] /3a) = exp {jjiO^, - biOai) + cijji)) , 

where da = Xo/3q,, for a G An- Here we have assumed that there is no dispersion parameter, 
and we further assume that h{6) is three times different iable and that h"{9) > for all 6. All 
of these assumptions would hold for probit or logistic regression and the Poisson log-linear 
model. 

A reasonable objective in this setting is to minimize two times the average Kullback- 
Leibler (KL) loss function, which is defined as 

2 n 2 " 

Lkl{(3J = - 5Z (log (7(i/,))-Eo (log Uy^■, /3J) = - [^i^ieo^ - 6^^) + (&(^m) - bieo^))] . 

i=l 1=1 

For a given sample size n, we define 0* = Xq/3* as the minimizer of the KL loss. By Theorem 
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1 in Lv and Liu (2010) we have that 0* is the unique solution to the equation 

X',{ti-b'{6)) = 0. (3.1) 

If g{y) = fa{y',/3o) for some true parameter jS^ for any a, then /3* = /3q. However, if we 
assume that we are in the non-true model world, then g{y) is not completely specified by 
any of the candidate models and we refer to /3* as the "pseudo-true parameter" based on 
the candidate model a. 

Similarly to the Gaussian model, for a given A, we take 6x = and denote the 

maximum-likelihood estimator based on the model ax by 6a^ = Xq,^/3„^. If we let A„ denote 
the regularization parameter selected by a given selection procedure, then the procedure is 
defined to be asymptotically loss efficient if 



LKL{f3x 



1 



and f3n{Xn) is said to be an asymptotically loss efficient estimator. 

ZLT studied the asymptotic performance of AICx in a similar setting. To establish 
asymptotic loss efficiency, ZLT restricted the set of candidate models to the set 

V = {a : sup \0a — 0o\ ^ in probability, as n — )■ oo}, 

where 0q = X/3q. For this restricted set of models, the maximum-likelihood estimator con- 
verges uniformly to the true parameter. If this set is known in practice, then the model 
selection process reduces to selecting the most parsimonious model in this set. This class of 
models would rarely be known in practice, so this motivates us to weaken this assumption 
and to prove the efficiency of AICx for a general set of candidate models. 



Under the regularity conditions (R1)-(R2) given in the supplementary material. White 



(1982) proved that /3q, — /3* — )■ 0, almost surely, and established the asymptotic normality 
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of ^„ - /3* under (R1)-(R4). With the additional condition (R5), |Nishii| ([1988[ apphed a 
Taylor expansion to show that 



(3.2) 



for n sufficiently large, where A„ = — ^d'^l{f3*^) / df3df3^ and Vj = Op{\\^^ — for j 

1, . . . , dfcf. 

We define the risk function of the maximum- likelihood estimator to be Rxii^a) 
Eo(-^iCL(/3a))- From Theorem 4 of 



Lv and Liu 



(2010), under (R1)-(R6), 



n 



where Wq = diag{al, . . . , cr^} and = diag{b"{9ai), • • • , &"(6'q,„)}. Similarly to the Gaus- 
sian model, we further define the random variable 



+ 0(1). 



n 



With these results and the following assumptions, we can prove the efficiency of AICx. 

(Al') (^X'X)^^ exists and its minimum and maximum eigenvalues are bounded below and 
above by constant numbers Ci and C2, respectively. 

(A2') E{yi — /ij)^^ < 00, for i = 1, . . . , n and some positive integer q. 

(A3') The risks of the maximum-likelihood estimators satisfy 



a&An 



(A4') sup0 6"(^) < 00 



(A5') -y/nAmax < ^1 for all n for some constant Mi > 0. 
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(A6') For any 9, p'{9) < M2X for some constant M2 > 0. 
(A7') nLKLil^D/dn 00 as 00. 

The first tliree assumptions are analogous to the assumptions made in the Gaussian model, 
and assumption (A4') is a mild regularity assumption. As shown by the following lemma, 
assumptions (A5')-(A7') are sufficient conditions for the penalized estimator to be close to the 
maximum-likelihood estimator. These assumptions are analogous to the sufficient conditions 
used in the Gaussian model. They are stated explicitly here since they are required in parts 
of the efficiency proof. 

Lemma 3.1. Under (A5')-(A7), 

WW 

sup — ^ 0, 

where hi is a dn x 1 vector where hi — p'\{\$\i)\)sgn{$\i) for all i such that \$\i\ > and is 
equal to otherwise. 

The proof is given in Appendix B. The next theorem estabhshes the efficiency of AICx. 

Theorem 3.1. Assuming dn/n — > as n — >■ oo, (Al' )-(A7 ) and the regularity conditions 
(R1)-(R6), the regularization parameter, A„, selected by minimizing AICx yields an asymp- 
totically loss efficient estimator, Pni^n)- 

The proof is given in Appendix B. 

4 Simulation Studies 

In this section we study the finite sample performance of the model selection procedures 
when the true model is not included in the set of candidate models. 

In all of the examples, the results are based on 1000 reahzations of samples with n — 
100, 200, and 400, and the selection procedures are evaluated based on their loss efficiency, 

16 



loss, and the variability of the selected number of non-zero coefficients. For each realization, 
if we let A„ denote the regularization parameter selected by a given selection procedure, then 
the loss efficiency is computed as 

minAe[o,A™„,] ^(^a) 

where L(-) is the L2 loss in the linear regression examples and is the KL loss in the GLM 
examples. For comparison, we also include results for the (infeasible) "Optimal" procedure, 
which selects the tuning parameter over the regularization path that produces the minimum 
loss for each realization and report the loss ("Min.Loss") achieved by this procedure. 



4.1 Linear Regression 

In this section we study the finite sample performance of the model selection procedures 
discussed in Section 2.2. The first set of simulations considers a trigonometric regression 
where the candidate models are in the neighborhood of the true model but never include the 



true model. This example is in line with the framework considered by Shibata (1980) and 



Hurvich and Tsai (1991). The second set of simulations look at an example where there is 



an omitted predictor. For example, the researcher may have access to some of the relevant 
predictors but may be missing others. This is the setting that was considered by ZLT. 



4.1.1 Choice of Penalty Function 

We consider two common choices for the penalty function. The first is the Smoothly Clipped 



Absolute Deviation (SCAD) penalty function proposed by Fan and Li (2001). This penalty 
function is defined by 



p'M = X{l{f3<X) + > A) 
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for some a > 2 and /5 > 0. Fan and Li (2001) recommended setting the second tuning 



parameter in the SCAD penalty function, a, equal to 3.7 and this is commonly done in 
practice; however, doing so will not necessarily guarantee that the SCAD objective function is 
convex and can result in convergence to local, but non-global, minima. As a result, in addition 
to studying the performance of SCAD with a = 3.7 (SCAD, 3.7), we study the performance of 
SCAD where a = max(3.7, 1 + 1/c*) (SCAD) where c* is the minimum eigenvalue of n~^X'X. 



The latter choice will force the objective function to be convex (Breheny and Huang, 2011) 



The wide use of SCAD is mainly due to the fact that it satisfies the "oracle property." 
This means that, assuming that the true model is in the set of candidate models and subject 
to certain regularity assumptions, there exists a sequence {A„} such that if A„ — )■ and 
y/n\n — )• oo then with probability tending to one the SCAD-estimated regression based 
on the full model will correctly zero out any zero coefficients and have the same asymptotic 
distribution as the least squares regression based on the correct model. This result was proven 



originally for dn fixed by Fan and Li (2001) and was extended to the case where dn < n but 



(i„ — )■ oo by Fan and Peng (2004). These results are for an unknown deterministic sequence 



that needs to be estimated in practice. 



The second penalty function that we study is the Lasso proposed by Tibshirani (1996) 



The Lasso penalty is the Li-norm of the coefficients. Necessary and sufficient conditions have 



been established for the Lasso to perform consistent model selection (Zhao and Yu, 2006) 



but in general the Lasso produces biased estimates and does not satisfy the oracle property 



(Zou, 2006). However, in the non-true model world, the oracle property has no meaning. 



since there is no true model. Further, even in the true model world, the oracle property is 
an asymptotic property. 

It is important to note that although ZLT only studied non-concave penalty functions, if 
the non-zero estimated coefficients, fB^^i, satisfy a relationship of the form 



1. 



n 



bi 
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with probability tending to 1 and (A4) is satisfied, then the efficiency proofs will hold for any 
penalty function. In the above, bi are the elements of b that correspond to fB^i- In particular. 



based on Lemma 2 of Zou et al. (2007), the Lasso satisfies this relationship and the same 



sufficient conditions provided by ZLT for (A4) can be used. Therefore, the efficiency proofs 
will hold for the Lasso, so it is interesting to compare the performance of the two penalty 
functions. 



The Lasso regressions are fit using the R lars package (Hastie and Efron, 2011) and the 



SCAD regressions are fit using the R ncvreg package (Breheny and Huang 2011 ). The lars 



package computes the entire regularization path for the Lasso and for SCAD the models 
are fit over a grid of 200 A values from Xmin to Xmax, where the first 100 values of A are fit 



on a log-scale and the last 100 values of A are equally spaced. Breheny and Huang (2011) 



considered a grid of 100 A values in their simulation studies. We have chosen a grid that is 
twice as fine in order to remain closer to the theoretical assumption that all possible values of 
A are considered. In all simulations, Xmax is specified so that all of the estimated coefficients 
are zero and A^m is chosen to effectively produce the least squares estimate on the full model. 



4.1.2 Exponential model 



Here we consider a trigonometric example based on an example studied in Hurvich and Tsai 



(1991). The true model is the model described as 



for i = 1, . . . ,n, where £j ~ A^(0, cr^). The estimated models are SCAD and Lasso penalized 
regressions where the matrix of predictors, X = (x^, x^), is a. n x dn matrix with components 
defined by 



'271 J 

x;, = sm I i 



1 
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and, 



2 / 27rj . 
x,„- = cos 1 



n 



for j = 1, . . . , (i„/2 and z = 1, . . . , n. The maximum number of predictors is allowed to vary 
by letting the dimension (i„ = 2[ra^/2j. It is shown in the appendix that for this example 
11^ — -ffolP > kind~'^ for some positive constant ki. Therefore, by Lemma 2.1, assumption 
(A3*) will hold so long as c < 1/3. In the simulations we take c = .3, and for comparison we 
also consider c = .5, .8 and .98. Note that examining dn close to n allows for the study of high- 



dimensional data problems, and is in the spirit of simulations performed in Tibshirani (1996) 



and Zou and Hastie (2005). Since the predictor variables are orthogonal in this example, 
setting a = 3.7 for SCAD satisfies the convexity constraint for all values of c. 

As in Hurvich and Tsai (1991), we examine both cr^ = 50 and cr^ = 100, but the patterns 
for the two error variances are similar so only the results for = 100 are reported. The 
median L2 loss efficiency is presented in Table |4]L2 for both SCAD and Lasso. For all values 
of c, the median loss efficiency of AICc;^ and Cp^ tend to one as the sample size increases, 
while the median loss efficiency of BIC\ does not show signs of convergence. These patterns 
are consistent with the theoretical efficiency results. When the number of predictor variables 
is small relative to the sample size, the loss efficiency of AICx also tends to one; however, 
as the number of candidate predictors is increased, the performance of AICx deteriorates. 
Figure [T] displays boxplots of the selected number of non-zero coefficients when n = 200, 
0"^ = 100, and c = .98. From this plot we see that AICx often selects a model that is close 
to the full model when c is large. As the sample size is increased the full model becomes less 
desirable and AICx suffers as a result. For SCAD, CCVx appears to suffer from a similar 
problem, but to a lesser extent than AICx- The difference in performance for varying values 
of c suggests that the good asymptotic performance of AICx and CCVx is strongly dependent 
on the fact that d^/n — )■ and these selectors may not perform well in finite samples when 
this ratio is close to 1. 

Overall, the sensitivity to the value of c clearly hurts the performance of AICx and can 
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also negatively impact the performance of Cp^ and GCV\. The impact on the latter two is 
more noticeable when looking at SCAD, but in both cases the extreme variability in the size 
of the selected model is undesirable. As a result, we recommend the use of AICc or 10-fold 
CV , which are less sensitive to the closeness of dn to n. 

Table 1: Median L2 Loss Efficiency over 1000 simulations for the exponential model with 
= 100. 











Median Loss Effic 


lency 










SCAD 


Lasso 


Info. Grit. 


n 


c=.3 


c=.5 


c=.8 


c=.98 


c=.3 


c=.5 


c=.8 


c=.98 


10-fold CV 


100 


1.00 


1.05 


1.07 


1.08 


1.00 


1.01 


1.05 


1.12 




200 


1.00 


1.03 


1.06 


1.05 


1.00 


1.01 


1.03 


1.07 




400 


1.00 


1.03 


1.03 


1.04 


1.00 


1.01 


1.02 


1.04 


AW, 


100 


1.00 


1.04 


1.18 


2.43 


1.00 


1.01 


1.07 


2.13 




200 


1.01 


1.02 


1.20 


3.08 


1.00 


1.01 


1.06 
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400 
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4.05 
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1.01 
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100 


1.00 
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1.00 
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1.00 


1.01 


1.06 


1.11 




400 


1.00 


1.02 


1.05 


1.06 


1.00 


1.01 


1.04 


1.08 


BIC\ 


100 


1.00 


1.07 


1.32 


1.64 


1.00 


1.05 


1.60 


1.64 




200 


1.02 


1.06 


1.47 


1.51 


1.00 


1.06 


1.74 


1.62 




400 


1.01 


1.07 


1.60 


1.51 


1.00 


1.08 


1.80 


1.60 




100 


1.00 


1.04 


1.10 


1.22 


1.00 


1.01 
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1.15 
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1.01 
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1.09 


1.15 


1.00 


1.01 


1.03 


1.09 




400 


1.00 


1.02 


1.08 


1.09 


1.00 


1.01 


1.03 


1.05 


GCV, 


100 


1.00 


1.04 


1.10 


1.69 


1.00 


1.01 


1.06 


1.16 




200 


1.01 


1.02 


1.10 


1.73 


1.00 


1.01 


1.04 


1.09 




400 


1.00 


1.02 


1.08 


1.82 


1.00 


1.01 


1.03 


1.05 



Figure 1: Comparison of model selection procedures based on the number of non-zero coef- 
ficients (includes intercept) in the selected model over 1000 simulations for the exponential 
model with n = 200, = 100, and c = 0.98. 




(a) SCAD (b) Lasso 



Figure [2] presents boxplots of the L2 loss for the 1000 realizations when n = 200 when 
c = .5 and c = .98. From this we can compare the optimal performance of SCAD and the 
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Lasso. Based on minimum loss, the predictive accuracies of the two methods are similar. 
This reinforces that the existence of an oracle property is not relevant in the non-true model 
world, and an estimator that does not possess the oracle property can still be effective from 
a predictive point of view. 

Figure 2: Comparison of model selection procedures based on L2 Loss over 1000 simulations 
for the exponential model with n — 200 and cr^ — 100. 
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(a) c=.5 (b) c=.98 



4.1.3 Omitted Predictor 

Here we study an omitted predictor example similar to example 2 in ZLT. The true model is 
defined as 

Hi = SXi,! + 1.5Xi^2 + 2Xi^io + Xi^is + Si 

where Ei ~ N{0, a^) for = 16 and = 25. We let X be a 2n x {dn + 1) matrix of 
predictors where the x.'-s are simulated from a multivariate normal distribution with mean 
and variance-covariance matrix E where Ej = for p = and 0.5. In the simulations 
X is simulated once and is used for every simulation run in order to resemble a fixed X 
setting. The estimated models are SCAD and Lasso penalized regressions based on the first 
n observations of X except with the 13*'* column removed so that the true model is never 
included in the set of candidate models. In order to compare predictive performance, we 
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treat the remaining observations of X as a hold-out sample and use it to compute the loss 
for each estimated model. 

In both examples the number of superfluous variables included in the candidate models is 
allowed to vary by letting the dimension dn = 2[n'^/2\. Under deterministic X, it is shown 
in the supplementary material that W/j, — > kin for some positive constant ki if the 

excluded predictor is orthogonal to the included predictors. By Lemma 2.1, assumption (A3') 
will then hold if dn/n — t- 0. This suggests that when the excluded predictor is uncorrelated 
or only moderately correlated with the included predictors it is reasonable to compare c = 
0.5,0.8 and 0.98. 

In this example setting a = 3.7 will not satisfy the convexity constraint for all values of 
c. Therefore, we further compare the case where a = 3.7 (SCAD, a = 3.7) to the case where 
a = max (3.7, 1 + 1/c*) (SCAD). 

The patterns for the two error variances and two values of p are similar so only the 
results for o"^ = 16 and p = 0.5 are reported. We first consider Figure |3| which presents 
boxplots comparing the three estimators based on loss when n = 200. From these plots it 
is immediately clear that all of the information criteria perform better when a is allowed 
to be data-dependent, while 10-fold CV performs well regardless of the choice of a. One 
possible explanation for this is that all of the information criteria under consideration were 
derived for use in classical least squares regression so they should perform well assuming 
that the estimated models are close to the corresponding OLS models. When the second 
tuning parameter of SCAD is fixed at 3.7, the objective function is not necessarily convex 
so the SCAD-estimated models may be very far from the OLS models. On the other hand, 
10-fold CV is a general model selection procedure that should work in a variety of settings. In 
general, we recommend using a data-dependent choice of a since it requires little additional 
cost and can greatly improve the performance of all of the information criteria. 

Focusing only on the data-dependent choice of a, we see that the performance of the model 
selection procedures is similar for both SCAD and Lasso when c = .8 and when c = .98, but 
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that the performance of SCAD is noticeably worse when c = .5. A possible explanation for 
this is that when c is small, the performance of the SCAD estimators is more sensitive to the 
choice of the second tuning parameter. Although taking a = max (3.7, 1 + 1/c*) guarantees 
that the penalized loss function is convex, it may not be the optimal choice for this parameter 
and more investigation into the choice of this parameter is needed. Of course, this implies an 
advantage of Lasso over SCAD, since it does not require the choice of this second parameter. 

Comparing the model selection procedures, we again see that AICx, GCVx, and Cp^ are 
sensitive to the number of predictor variables while AICc^ and 10-fold CV maintain good 
performance. The boxplots of the selected number of non-zero coefficients are omitted since 
the patterns are similar to those seen in the exponential model. In Figure [3]it is clear that this 
sensitivity to the value of c impacts the performance of the model selection procedures, and 
as a result 10-fold CV and AIC^^ outperform the other procedures. 10-fold CV outperforms 
AICc^ in some scenarios, but, in general, the performance of the two methods appears to be 
comparable. 
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Figure 3: Comparison of model selection procedures based on L2 Loss on new design points 
over 1000 simulations for the model with an omitted predictor with n = 200 and p = 0.5. In 
order to make it easier to compare the procedures, the limits of the vertical axis are specified 
so that all the boxes and whiskers appear but some of the outliers are not shown. 
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In order to study the asymptotic behavior of the selection procedures, Table 2 presents 
the median loss efficiencies. With the exception of SCAD with c = 0.5, the loss efficiencies 
of AICc^, Cp^, and GCVx tend to one, while the loss efficiency of BICx does not show signs 
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of convergence. Also, the results again show that AIC\ performs poorly when the number 
of predictor variables is large relative to the sample size. For SCAD with c = 0.5, the loss 
efficiency of the efficient methods do not show signs of converging to one, which further 
suggests that the second tuning parameter may not be optimally selected. Overall, the 
results corroborate the theoretical findings, but reinforce that the finite sample performance 
of asymptotically equivalent methods may vary greatly. 

Table 2: Median L2 Loss Efficiency on new design points over 1000 simulations for the model 
with an omitted predictor with p = 0.5. 
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SCAD 


SCAD, a= 


=3.7 


Lasso 


Info. Crit. 
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c=.5 


c=.8 


c=.98 


c=.5 


c=.8 


c=.98 


c=.5 


c=.8 
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10-fold CV 


100 


1.32 


1.10 


1.09 


1.72 


1.23 


1.20 


1.08 


1.09 


1.08 




200 


1.19 


1.07 


1.07 


1.35 


1.08 


1.10 


1.05 


1.06 


1.05 




400 


1.14 


1.05 


1.05 


1.26 


1.02 


1.04 


1.04 


1.04 


1.04 


AICx 


100 


1.49 


1.44 


41.44 


2.78 


2.57 


37.24 


1.08 


1.19 


37.64 




200 


1.57 


1.24 


51.80 


3.03 


3.30 


59.94 


1.06 


1.12 


49.77 




400 


1.84 


1.11 


67.73 


4.13 


3.16 


76.07 


1.04 


1.07 


64.94 




100 


1.36 


1.13 


1.10 


2.19 


1.45 


4.27 


1.07 


1.09 


1.08 




200 


1.41 


1.08 


1.07 


2.45 


1.27 


10.10 


1.06 


1.07 


1.06 




400 


1.68 


1.06 


1.05 


3.31 


1.10 


17.28 


1.04 


1.04 


1.05 


BICx 


100 


1.12 


1.26 


1.40 


1.26 


1.41 


1.62 


1.11 


1.24 


1.31 




200 


1.07 


1.39 


1.40 


1.13 


1.31 


1.38 


1.21 


1.34 


1.33 
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1.05 


1.31 


1.32 
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1.16 
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1.42 
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1.22 


2.40 


1.65 


3.04 


1.08 
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1.24 
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1.46 


1.10 
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2.69 


1.42 


5.27 


1.06 
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1.14 




400 
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1.07 
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3.75 


1.14 
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1.04 


1.05 


1.08 
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100 


1.43 


1.20 


1.13 


2.49 


2.01 


14.24 


1.07 


1.11 


1.12 




200 


1.48 


1.11 


1.10 


2.75 


2.10 
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1.06 


1.08 


1.09 
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1.07 


1.06 


3.86 


1.26 


35.72 


1.04 


1.05 


1.06 



4.2 Poisson Regression 

In this section we present simulation results for GLMs with no dispersion parameter. For 



GLMs, it is less clear how to handle the second tuning parameter for SCAD. Breheny and 



Huang (2011) recommended using an adaptive rescaling technique, but it is unclear how 
such a procedure will impact the performance of the model selection procedures and initial 
simulations for Bernoulli data resulted in convergence issues. As a result we only study the 
Lasso in this section. The lars package is only designed for linear regression, so we instead 



work with the R glmpath package (Park and Hastie , 2011 ), which fits the entire regularization 
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path for the Lasso for GLMs. 



We consider a trigonometric example based on an example studied in Hurvich and Tsai 



(1991). We take 9t = e~^*/" for t = 0, . . . , — 1 and simulate yt from a Poisson distribution 
with fit = exp(6'i). The estimated models are Lasso penalized Poisson regressions where the 
matrix of predictors, X = (x^,x^), is a n x (i„ matrix with components defined as in the 
exponential model. Similar to before, we vary the maximum number of predictors by letting 
the dimension dn = 2[r2^/2j and we compare c = .3, c = .5 and c = .8. The case with c = .98 
is omitted due to convergence problems with the package. 

Although AICc was originally derived for linear regression, its use is commonly recom- 
mended in a more general setting when the number of predictor variables is large relative to 



the sample size (Burnham and Anderson (2002), p. 66). We therefore compare the perfor- 



mance of A/C^ to 10-fold CV, AlCr. and BIC^ where 



n n — dfx — 2 

and 

n n 



Table 4.2 reports the median KL loss efficiencies over the 1000 simulations. In all three 
cases, AIC\, AICc^, and 10-fold CV show signs of converging to one and have comparable 
performance, whereas BIC\ performs noticeably worse and does not show signs of conver- 
gence. Figure |4] presents boxplots of the selected number of non-zero coefficients. This figure 
suggests that the poor performance of BICx is due to its tendency to select models that are 
too sparse. In comparison, the other procedures select models with dimension closer to the 
optimal dimension. Overall, these results are consistent with the theoretical findings. 
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Table 3: Median KL Loss Efficiency over 1000 simulations for the poisson model. 
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Figure 4: Comparison of model selection procedures based on the number of non-zero coeffi- 
cients (includes intercept) in the selected model over 1000 simulations for the poisson model 
with n = 200 
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5 Analysis of a Real Data Set 



We now consider the transaction data set from Sela and Simonoff (2012) in order to compare 



the candidate models chosen by the regularization parameter selectors when applied to a 
real world data set. The data contains transactions for third-party sellers on Amazon Web 
Services and the goal is to predict the prices at which software titles are sold based on the 
characteristics of the competing sellers. The target variable is the price premium that a seller 
can command (the difference between the price at which the good is sold and the average price 
of all of the competing goods in the marketplace). There are 24 potential predictors which 
include the seller's reputation (the total number of comments and the number of positive and 
negative comments received from buyers over different time periods), the length of time that 
the seller has been in the marketplace, the number of competitors, the quality of competing 
goods in the marketplace, the average reputation of the competitors, and the average prices 
of the competing goods. The data set contains 100 observations. 

Table [5] reports the results for the information criteria as well as 10-fold CV based on 
two different runs (and hence two different random divisions of the data), which are referred 
to as 10-fold CV (1) and 10-fold CV (2). Only six predictor variables were ever selected so 
the remaining variables are omitted from the table. It is clear that the variables selected 
are heavily reliant on the selection procedure and the penalty function chosen. In particular, 
there is a noticeable difference between the variables selected by AIC\ and AICc^ , and in all 
three cases BIC\ selected a model with no predictors, suggesting that it may be selecting 
an underfitted model. If we approach this problem from a predictive point of view, we 
know that there is little advantage to using SCAD over the Lasso, but that the choice of 
the second tuning parameter can greatly impact the performance of the former. Therefore, 
we recommend focusing on the Lasso. From the simulations we know that 10-fold CV 
maintains good performance in a variety of settings. However, it is 10 times more expensive 
to implement than using an information criterion, the asymptotic properties of 10-fold CV 
are not fully understood in this context, and the randomness involved in the procedure makes 
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it difficult for data analysts to reproduce results. In the case of the Lasso, this last point is 
reinforced by the change in the selected variables between the two runs of 10-fold CV, as in 
the first run four nonzero coefficients were estimated, while in the second run none were. We 
recommend proceeding using AICc,^ as the selector of the tuning parameter for the Lasso as 
an alternative that avoids these issues. 



Table 4: Selected variables for transaction data. 



Selector 


Ave. Comp. Ave. Comp. Ave. Comp. Seller Negative Negative 
Price Condition Rating Condition Comments (Comments 

(30 days) (Lifetime) 


SCAD 


10-fold CV (1) 
10-fold CV (2) 
AIC 

GCVx 


X 

X 

X X X X X 
X 

X 

X X X X X 



SCAD (o = 3.7) 


10-fold CV (1) 


X 












10-fold CV (2) 


X 












-4/Ca 


X 


X 


X 


X 


X 


X 




X 


X 


X 


X 


X 


X 


BI(\ 
















X 


X 


X 


X 


X 


X 


GC'Vi 


X 


X 


X 


X 


X 


X 




LASSO 


10-fold CV (1) 


X 


X 




X 


X 




10-fold CV (2) 














AIC^ 


X 


X 


X 


X 


X 




AICa 


X 












BIC 














Gp). 
GCVx 


X 
X 













6 Concluding Remarks 

This paper studied the asymptotic and finite sample performance of classical model selection 
procedures in the context of penalized likelihood estimators without the assumption that the 
true model is included amongst the candidate models. We proved that AIC\, AICc^, Cp^, and 
GCV\ are efficient selectors of the regularization parameter for regularized regression, and 
the numerical studies for regularized regression yielded several interesting observations. As 
anticipated, we found that BIC\ is outperformed by the efficient model selection procedures 
and demonstrated that AIC\, BICx, Cp^, and GCV\ are all sensitive to the number of 
predictor variables that are included in the full model and that their performance can suffer 
as a result. In light of this issue we recommend that researchers use a method that is 
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insensitive to the number of variables included in the model. From the simulations, 10-fold 
CV has the best overall performance. However, the discussion in Section 5 noted some of the 
disadvantages of this method including computational cost and variable results due to the 
inherent randomness of the procedure. As an alternative, data analysts can consider using 
AICc^ , which was shown here to be an efficient selection procedure for the tuning parameter, 
and which the simulations suggest has comparable performance to that of 10-fold CV. Lastly, 
the simulations suggest that there is no clear advantage to using SCAD in a world where the 
"oracle property" does not apply. Combining this with the facts that the Lasso can be fitted 
using the efficient 'Lars' algorithm and does not involve a second tuning parameter that can 
greatly impact results, researchers may prefer to use the Lasso if they feel that they are in 
the non-true model world. 

To further generalize our results, we also proved that AIC\ is an efficient selector of the 
regularization parameter for regularized GLMs with no dispersion parameter and used numer- 
ical studies to compare its performance to that of AICc^, BIC\ and 10-fold CV . Again, the 
performance of BIC\ was noticeably worse than the other procedures, and the performances 
of AICx, AICc^ and 10-fold CV were comparable to each other, supporting our recommen- 
dation for the use of AICc^- Extending these results to GLMs with an unknown dispersion 
parameter is an interesting open problem. In this setting it is necessary to work with ex- 
tended quasi-likelihood methods. Although model selection criteria such as AICc have been 



proposed in such settings as (Hurvich and Tsai, 1995), the extended quasi-hkelihood is not a 



true likelihood so the results of White (1982) and Nishii (1988) do not apply. Investigations 
into the properties of model selection procedures in this setting is an area for future research. 

As a final remark, this paper dealt with the case when dn/n — )■ 0, and the theoretical 
results cannot be directly extended to the case when dn/n converges to something other 
than zero. The latter setting has received a great deal of attention in recent literature (in 
particular dn ^ n) and is an area for future investigation. 
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Appendix A 



Proof of Lemma 2. 1 . By definition 



Then by (2.1) and (2.2) 
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Before proving Tlieorems 1, 2, and 3, we establisfi tlie following two lemmas. 
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Proof. The technique used to prove this result is similar to the proof of Theorem 2 in Shibata 
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lAtg;, - MaI 

n 
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Applying the Cauchy-Schwarz inequality, it follows that 



Then 



\ai-a'\<L{(3^^ + 2\\e 



+ 



n 



a 



n 



+ 



n 



nLiP, 



n 



m 



ax/ 



+ 



+ 



2 


da\ 1 1^1 1 


1/2 




a 


n n 







1/2 



ax) 



1/2 



1 



+ 



n 



By definition, 



max) > ""'"^"^ 



n 



Thus 



- <^'^\dax 



< sup 

A6[0,Am, 

+ sup 

Ae[0,A„„ 



sup 

Ae[0,A™t 



ax) 



dax Ikll 

a n n 



L(/3„Ji?(/3 



1/2 



«A^ 



i(/3A) 



(A.l) 



^(/9a) 

^l|2 



1/2 



L{Pax)RiP. 



ax' 



1/2 



+ sup 

Ae[o,A,„, 



ax) J 
dax IIAqa 



AaII^ 



n 



nL{(3^) 



Li (1987) established that 



sup 



Li(3a) 



R(Pa) 



and it follows that 



sup 

aeAn 



m 



ax) 



(A.2) 
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In addition, from the proof of Theorem 2 in ZLT we have that 



sup 

Ae[0,A 

max\ 



L(/3„J-L(/3, 



(A.3) 



and 



sup — 

Ae[o,A^a^] nL{(3x) 



(A.4) 



Combining these results with the Law of Large Numbers and the assumption that dn/n — )■ 



as n — 7- oo the four terms on the right-hand side of equation ( A.l ) converge to in probabihty. 
Hence, 



sup 



Ae[o,A™a:.] nL(/3 



X) 



as desired. 



□ 



Lemma A. 2. Assume that (Al)-(A4) hold and that dn/n — )■ as n — )■ oo. Then 



sup 

Ae[o,A„aa;] nL{/3x 



Proof. Start by noting that for all A G [0, Xmax], ^ (ZLT). Consider 



R{l3^)d^, ^ (A^ + ^)do,. 



< 



A, 



+ 



< 2. 



From the proof of Lemma 1 we have that 



n 



n 
dr. 



n 

dr,. 



n 



+ 



n 



\£\ 

dr, 
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Thus 



< 



n d„ 2 

+ 



n 



nL0^) n-dn-l n a n - dn - 1 



\e\r d„ 



n n 



1/2 



1/2 



[n- dn - 1)0-2 



Under the assumption that dn/n — t- as n — ?► oo it follows that 



dn\d-'^ - cP'^ 



0. 



Combining these results with (A. 2) and (A. 3) it follows that 



rfgj^'-^'l ^ d„|a2 - g^l d^^R{i5-o) HP^) i?(/3^J 

^ _^ sup /V _ ^ ^ /V ^ -^ 

nL(/3A) [o,A_.] nL(/3^) d„i?(/3, J i?(/3^) L(/3, J L(/3 J 



< 2 — ^ i- sup 



0. 



□ 

Proof of Theorem 1. As in the proofs in ZLT, to prove that Cp^ is asymptotically loss efficient, 
it is sufficient to show that 



sup 

ag[o,a„„ 



Pa 



Vn - L{f3y 



(A.5) 



Decomposing Cp^ it can be established that 



Px 



|y- AaIP ^ ^g^rf^^ 



n 



n 



n 



+ L(/3J + (L(^„J-L0J) 



IIAa, - AaII^ 



n 



n 



n 



n 
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The proof of Theorem 2 in ZLT estabhshed that 



sup 

Ae[0,A 

max 



and, 



sup 

Ae[0,A max 



Combining these results with (A.2)-(A.4) and Lemma 2, (A. 5) follows as desired 



□ 



Proof of Theorem 2. The proof is the same as that of Theorem 1 except that the estimated 
variance is based on the candidate model rather than the full model and the result is estab- 
lished by using Lemma 1 in place of Lemma 2. □ 

Proof of Theorem 3. As in the efficiency proof for F^, it is sufficient to show that 



sup 

Ae[0,Am< 



(A.6) 



to establish that Fa is an asymptotically efficient selection procedure for the regularization 
parameter, A. By the definition of F^ we have that 



sup 

Ae[0,A 

max J 



fA-||£||Vn-m 



sup 



< sup 



+ sup 

Ae[0,A 

max[ 



6xCTl + rx-\\s\\yn-L{f3, 



aJ 



A 

rA-||e||V^-^(/9A) 



+ sup — ^ — 



L{(3 



X) 



The last two terms converge to zero by (CI ) and the efficiency proof for F^. From the proof 
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of the previous lemma we further have that 



/- ,\ 1/2 / \ 1/2 

s\\ fL{/3^j\ / \5x\ \ 



n 



+ 



n 



l|2 



1/2 



nL{f3, 



By (CI), (C2), and similar arguments as those used in the efficiency proof for Tx we have 



that the right hand side converges to in probability. Therefore, it follows that 



sup 

Ag[0,A 

max 



and so equation (A. 6) holds as desired 



□ 



Appendix B 



Proof of Lemma 3J_^ Under assumptions (A5')-(A7'), 



sup 



' ' ' ^ ^ max ^ i ^ y Q 



Ae[0,A„ax] RKL0aJ ^KLiPl) TlLKLiPl] 



Lemma B.l. Under (R1)-(R5), for n sufficiently large 



Proof. Taylor's expansion of b(Oa) around 0* gives us 



□ 



* \ \2\ 
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For n sufficiently large, we have that 



LKL0a) = ^/^^(^O -Oa + ^l^K^a) 

n n 
n 



where the last equality follows from equations (3.1) and (3.2). 



□ 



Lemma B.2. Under assumptions (Af )-(A4' ), (A7) and regularity conditions (R1)-(R3), 
the following results hold. 



sup 



11/ -M 



\T(Q* 



0, 



sup 



- tr{(X;W,Xj-iX;WoX J) 



sup 



((y - M)'H„(y - /x) - tr{(X',W,X,)-iX'„WoXj) 



and 



sup 

a&A„ 



nRKL{l3a) 



0, 



^p 0. 



(B.l) 

(B.2) 
(B.3) 

(B.4) 



RKLif3J 

The proof of this lemma requires the following matrix algebra results. 

Definition B.l. Let A and B be two K x K matrices. We say that A > B z/ A — B 
positive semidefinite. 



Lemma B.3. (Horn and Johnson, 1985, p.i'^l) If '^'^^ B are K x K positive definite 
Hermitian matrices, then 



(i.) A > B z/ and only ifB~^ > A~^; 
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(a.) z/ A > B, then Afc(A) > Afc(B) for all k = 1, . . . , K , where Afc(A) and Ajt(B) are the 
k^^ largest eigenvalues of A and'B, respectively. 



Lemma B.4. (Marshall et al, 2010, p. 340) If A and B are K x K positive semidefinite 
Hermitian matrices, then 

K 

tr(AB)<^Afc(A)Afc(B). 



k=l 



Proof of Lemma B.2. We start by proving equation (B.l). By Chebyshev's Inequality and 



Theorem 2 of Whittle (1960), we have that 



Pr sup 



Or 



nRKL0a) 



>6] < 



c 



(B.5) 



Now RKL0a) — LKL{P*a)- If we cousider Lkl{-) as a function of 6, then by a second order 
Taylor series expansion around Oq, 



Lkl{0:) = he:- 0ofW (01-00), 

f 



where W = diag{b"{9i), . . . , b"{9n)} and 9i is on the line segment between 6'*^ and 6'oi. Since 
b"{e) > for all 6, it follows that nRKL0a) > ^ll^a - ^o|P for some constant K > 0. 



Therefore the right-hand side of equation (B.5) is less than or equal to 



5^ 



J2 (nRKLik)) 



for some constant C > 0, which tends to zero as n — > oo by assumption (A3'). Next, to 
establish equation (B.2) we first note that X'^WoX^ < maxi<j<„ X'^Xq, and X'^W^Xa > 



mmi<i^nb"{6ai)^'a^a- From Lemmas B.2 and B.3 it follows then that 

tr((X',W„X„)-iX'„WoX„) < '^''''^-'-Zf^ , Ai ( f-X'^X,) | Ai f-X^^X^ ) < d^C 
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for some constant C > 0. Using this result we have that 



sup 



2(d, - tr{(X',W,X„)-iX'„WoX,}) 



2d„(l + C) ^ dnil + C) 

< sup ^ — < 



which tends to zero by assumption (A7'). 



To prove equation (B.3) we apply Chebyshev's Inequality and Theorem 2 of Whittle 



(1960) to get that 



Pr sup 

\aeAn 



2((y - /i)'H„(y - fi) - tr{(X^W„X,)-iX'„WoX„}) 



nRKLiPa) 



> 6 



for some constant C > 0. Using the fact that tr{AB} < Ai(A)tr{B}, 



tr{(X:,W,X„)-ix:,WoX„(X:,W„X„)-iX'„WoX„} < ii:tr{(X'„W,X„)-iX',WoX,} 



for some constant K > Therefore 



Pr sup 

\a<^An 



2((y - /^)'H„(y - /x) - tr{(X^W,X„)-iX'„WoX„}) 



nRxLiPa) 



> 6 



< ^-2,^/ tr{(X^,W^X^)-iX^WoX^}'' 



aeA 



for some constant C > 0. Since 



tr{(X'„W„X,)-iX'„WoX,} 



n 



< RKliPa), 



it follows that 



Pr sup 

\a&A„ 



2((y - /i)'H,(y - /x)) - tr{(X',W„X„)-iX',WoX„}) 



uRkUPo 



> 6 



a&An 
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Finally, equation (B.4) follows from (B.3). □ 



Lemma B.5. Under (Al' ) 

||6'A-^ajr<nC||b||2. 



Proof. satisfies 



n op 



Without loss of generality, we can write = (/^aij ^\2)' where = and is a 1 x da^ 
vector of estimated coefficients. Applying the mean value theorem, we get that 

where ^ is on the line segment joining 0^^^ and /^^a' ^^"^ '-'i non-zero components of 

b that correspond to ^y^^. For n sufficiently large, it follows then that 

- Pa, = (^X;^ W,X„,) bi, (B.6) 

where = diag{b"{9i), . . . , b"{9„)}. Therefore 

||0A-0aJ|' = \\Xa,0xi-kj\\' = nK Qx;^W,X„,) Qx;^X,,) (ix:,^W„X„, j bi. 
Since 

Qx',^W„X„,) ' Qx^^X,,) (ix:,^W„X.,) '<(mm 6"(0-,))-2(^X:,^X„,) \ 

(B.7) 



\\ex-e^j\' <nc\\h\\' 

by Lemma B.3 and assumption (Al'). □ 

Since An includes all subsets, the results in Lemma B.2 will still hold when the candidate 
model a is replaced by the random candidate model a\. 
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Lemma B. 6. Under (Af )-(A7 ), 



sup 



Lkl0x) 



Lkl{!3 



(B.8) 



Proof. Applying a second-order Taylor expansion, we get 



where the last equality follows from the fact that is the maximum-likelihood estimator 
so X;jy-6'(^,J) = 0. 



By equation (B.6) and assumptions (A5') and (A6'), the first term is bounded by 



M^iy-^.r^{^-X:^W^XlJ-'l 

n \ n n ^ ^ 



where 1 is a d^^ x 1 vector of ones. Applying Chebyshev's Inequality and Theorem 2 of 



Whittle (1960), we have that 



for some constant C > 0. By equation (B.7) and assumption (Al'), this does not exceed 



C 



dl 



By (A6'), d/nRKL{l^*a) ~^ 0) so, for n sufficiently large, da < nRKL0a)- Therefore, the last 
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quantity is less than or equal to 



6^ 



which tends to zero by assumption (A3'). Thus 



sup 

Ae[0,AmE 



2(y - f^ne, -0aj^. 



Assuming that (A4')-(A7') holds, equation (B.8) follows from this result and Lemma B.5 □ 



Proof. To prove the efficiency of AICx, it suffices to show that 



sup 

Ae[o,Am, 



AICx-iy^eo + iin{eo)-LKL{f3, 



Lkl{Px) 



0. 



(B.9) 



Consider 



AIC, - -y^Oo + -1^6(00) = -y^(0o - ^a) + -I^(K^a) - K^o)) + 2^ 
n n n n n 



LKL0x) + -{y-f^fiOo-oi 

n 



ax) 



By the expansion in equation (3.2) we have that 



asymptotically. Therefore 



AICx - -y^Oo + -l^biOo) = Lkl0x) + -{y- m)^(0o - 
n n n 



+ -{d^, - tr{{X'W,^X^,)-'X'W,X^J) + -{y - fj,y{e 



n 



n 



'ax 
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Applying Lemmas B.2 and B.6, equation (B.9) holds as desired. 



□ 



C Supplementary Material 

This supplemental section contains the technical details required to show that Theorem 3 
can be used to prove the efficiency of AIC\, GCV\, and AICc^, the regularity conditions 
required for Theorem 4 to hold, and the mathematical results needed to apply Lemma 2.1 
to the simulation examples. 

C.l Verifying the Conditions of Theorem 3 

The following shows that AIC\, GCV\, and AICc^ can be written in the form r„(A) and that 
Conditions (CI) and (C2) of Theorem 3 are satisfied. This implies that the three methods 



are efficient selectors of the regularization parameter. Shibata (1981) and Hurvich and Tsai 



(1989) noted that AIC and AICc, respectively, can be shown to satisfy these conditions. We 



present a detailed argument of these remarks below. 



AICx is Efficient 



Minimizing AICx is equivalent to minimizing 



exp 



n 



Using Taylor's expansion we get 



exp 



2d. 



n 



fc=0 



n 



kl 



2d ^ /2d \^ I 



+ / 

k=2 



n 



k\ 
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and we see that AIC\ has the same asymptotic properties as 



r, = a!(i + 2r^ + 5,), 



where 



5n{X) = E ( 
k=2 ^ 



n 



kV 



Therefore, the efficiency of AIC\ can be estabhshed by showing that (CI) and (C2) hold. 
Consider 



\ n J k\ \ n J n 



Therefore, under the assumption that dn/n — > 0, (CI) is satisfied. Next consider 



< 



5x 



«(/5.x) t 



E 



n 



k=2 ^ 



k=2 ^ 



2 A /2d„\'' 1 



k=l 



, 7 1 2 I *^^P , 



2dr 



{k-l)\ 
11^0. 



Here the inequahty on the second hne follows from the fact that R{Pax) > cr'^da^/n and the 
final result follows from the assumption that dn/n ^ 0. Therefore, 



sup 



sup 



Ae[0,A max max\ 



ax) Ox 



L{Px) max) RiPax) 



SO (C2) is satisfied. 



GCVx is Efficient 



Using Taylor's expansion we get 



(l-dax/nY^ 



= y^k( ^ = 1 + + VA: f — 



k=l 



oo / 7 \ fc-1 
On 



n 



A;=3 



n 
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and we see that GCV\ has the same asymptotic properties as 



r, = a!(i + 2r^ + 5,), 



where 



5x = Y.k 



k=3 



n 



k-l 



Therefore, the efficiency of GCVx can be estabhshed by showing that (CI) and (C2) hold. 
Consider 



"A 



n 



Therefore, under the assumption that dn/n — > 0, (CI) is satisfied. Next consider 



< 



r °° / J \ k-l -I 

0\ . / da^\ 1 



k-2 



k=3 

\fe=3 ^ ^ k=3 ^ 



\k=2 



n ^-^ \ n 

k=o 



-1 + 



0-2 \^(1 - da^/n)2 1-dax/nJ' 



which converges to zero uniformly over A under the assumption that dn/n 0. Here, again, 
the inequality on the second line follows from the fact that R{$ax) > cr'^dax/'n. Therefore, 



sup 



sup 



Ae[0,A max ] L{j3x) Ae[o,A. 



L(/3,J i?(/3«J 5x 



L{Px) LiPax) RiPax) 



SO (C2) is satisfied. 
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AICc)^ is Efficient 



We define 

AICc^ = \og{ai) + 2 



This can be equivalently defined as 

n n[n — da^ — 2) 

Based on the second definition of AICc;^ we see that the information criterion has the same 
asymptotic properties as 

log(a^) + 2^ + 2 



2^ , o'^^A , o('^«A + +2) 



n 



n{n - da)^ - 2) 



because they only differ by an additive constant (2/n). Therefore, AIC^^ will have the same 
asymptotic behavior as 



n n{n — d^^ — 2) 



Using Taylor's expansion we get 



V n n{n -da^-2) ) V "-('^ " - 2) ) k\ 

= 1 + ^ ^ (dc,, + l)(da,+2) 

n n(n — — 2) 

I vfg^aA I (rfa,+l)(rf«,+2) y 1 

i£^V " n(n-da,-2) ; kV 

and we see that AICc^ has the same asymptotic properties as 

f, = al (l + 2^ + 5,), 

where 



n(n - - 2) V n n(n - da^ - 2) 7 A;! ' 
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Therefore, the efficiency of AICc, can be estabhshed by showing that (CI) and (C2) hold. 
Consider 

n(n -dax-2) \ n n(n - d^, - 2) J k\ 

^ Jd^, + l)(d^, + 2) ^ / ^ ^ K, + l)K,+2) \ 
n(n - do,;, - 2) V n n(n - da^, - 2) ) 

n n(n — d^;, — 2) ' 

which converges to zero uniformly over A under the assumption that — )■ 0. Thus, (CI) 
is satisfied. Next consider 



Q ^ '^A ^ ^ (d», + l)(ci^, +2) 
^(/3aJ i?(/3« Jn(n - - 2) 

+ V 1^2^ + g KA + l)(^a.+2) \ ' 1 

" n(n-da, -2) y ^(/3*Jfc! 

^g (l + VrfaJ(^a.+2) 

~ a'^{n-da,-2) 

^'^dax n n{n- da, - 2) J k\ 

,^ (l + l/d^J(d„,+2) 



+ 

^ g V-i- -r ■L/u,Q,;^;vu,a;^ -r 

~ a2(n - da;, - 2) 

2 / (l + l/daJ(da,+2) \ ^ f d^ (da, + l)(da,+2) y-^ J. 

(n-da,-2) y^^V « n(n-da,-2) ^ fe! 

^ Jl + l/daJ(da,+2) 
~ a'^{n-da,-2) 

2 / (l + l/daJ(Qia,+2) \ ^ / do, (da, + l)(da,+2) y j. 
^a^V ^ (n-d„,-2) " n(n-da,-2) J k\ 

^^ (l + l/daJ(da,+2) 
a'^{n-dax - 2) 

^ 2 / ^ + +2)\ / / ,4, ^ ^(d +l)(.i,.+ 2)\ _ X _ 

cr^ V {n-dax-2) J \ \ n n{n - da, - 2) J ) 
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which converges to zero uniformly over A under the assumption that dn/n — )■ 0. Again, the 
inequahty on the third hne follows from the fact that R{(3^{ax)) > o'^da^/n. Therefore, 



sup 

Ae[o,A™< 



I^aI 



sup 

Ae[0,A™< 



L(/3«Ji?(/3, 



ax J 



SO (C2) is satisfied. 



C.2 Regularity Conditions 

Below are the regularity conditions required to derive the properties of the maximum- 



likelihood estimator for misspecified models. Refer to Lv and Liu (2010) for a discussion 



of these conditions in the context of generalized linear models with no dispersion parameter. 

(Rl) fa{y] /3) is continuous in f3 for every f3 in Q, a compact set of M'^". 

(R2) (a.) Eo(log{g{y))) exists and | log fa{y', is dominated by an integrable function with 
respect to g that is independent of f3. (b.) The KL loss function has a unique minimum 
at (3* , which is an interior point of Q. 

(R3) (a.) d log faiy, /3)/df3i and 9^ log/(?/; /3)/5/3i9/3j , i,j = l,...,da, are measurable func- 
tions of y for each /3 G and continuously differentiable functions of f3 for each y. (b.) 
\dlogUy-(3)/d(3,\,\dlogf{y;(3)/df3,df3,\,and\{dlogUy-(3) 

are dominated by integrable functions with respect to g, which are independent of f3. 



(R4) The matrices 



and 



3(9*) = Eo 



d\ogfaiy;(3) d log fa{y;f3) 
0(3 80" 

dHog Uy- 13) 



8(380^ 



are positive definite. 
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(R5) (a.) log fa{y; (3)/d/3i/3j/3k are measurable with respect to y for i,j,k — 1,..., daipha- 
(b.) \d\og Uy;f3)/dp,\', \dHogUy;f3)/dp,dp,\^ and \dHog Uy; f3)/dPMjdPk\' , 
i,j,k = 1,. . . ,da, are dominated by integrable functions with respect to g that are 
independent of (3. 

(R6) For some 5 > 0, E\\'Bn^^'^ An0a ~ ^*a)\\^'^^ — where A„ is defined as in equation 

(3) of the manuscript and B„ = XjWoXc^. 

C.3 Verifying the Conditions of Lemma 2.1 
C.3.1 Omitted Predictor with Deterministic X 

We first consider a more general example. Let the true model be defined as 

where y is the n x 1 response vector, /j, is the n x 1 unknown mean vector, and e is a n x 1 
noise vector where E(£i) = and var(£j) = a^. In what follows we assume that 

where X is a n x deterministic matrix of predictors, /3 is a x 1 vector of coefficients, a^cxci 
is a n X 1 deterministic vector, and /3exci is a constant. In the following, we take the candidate 
models to be the least squares regressions based on all 2'^ subsets of X; the predictor cCexci is 
excluded from consideration so that the true model is never included in the set of candidate 
models. 

Assume that the following conditions hold: 
(C3) /9 contains a fixed number of non-zero entries 
(C4) cCexci is orthogonal to the columns of X 
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(C5) mf„ ^^2^.1^ > 

By construction, for any candidate model a, 

nR{(3a) > Ha* — HafJ'W'^ 

= \\{I- H^)X(3\\^ + 0^X^{I - /f^)aJexcl/3excl + lla^excl/Sexclir 
= ||(/-/f^)X/3||2+||aJexcl/3excl|r 

^ 1 1 •^excl/^excl 1 1 

/ T \ 
2 / ^cxcl^cxcl \ 



— "^/^excl 

> kin 



n 



for some constant ki > 0. 

For the simulation example in Section 4.1 of the paper, the true vector of coefficients is 
fixed and trigonometric predictors are used so conditions (C3)-(C5) are satisfied. Therefore, 
for that example it follows that \\n — HafJ-W"^ > kin for some constant ki > 0. 

C.3.2 Exponential Model 



From Fourier analysis (cf. Bloomfield (2000)), if n is even then 



;,, = e^*/^ = A(0)+ Yl A{fj) cos (271 fjt)+ J2 5(/,)sin(27r/,t)+A(/„/2)cos(27r^/2t), 

0<i<n/2 0<j<n/2 

(C.l) 



where fj = j/n, 

^(/i) = -$^/^iCos(27r/,t) 



2 



n 



and 

2 

Bif,) = -J2l'tsm{2nf,t). 



n 
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If n is odd then the rightmost term in (C.l) is excluded. To determine A{fj) and B{fj) we 



will use the fact that d{fj) 



— i- 



2 ' 



where 



^ n— 1 
i=0 



For this example 



n-l 



1 - e 



4-27ri 



1-e^ 



t=o 



n 



(1 _ e4/n-27ri/,) n (1 - 6^/" cos(27r/j)) + ic^/" sin(27r/j) ' 



For any real constants a and b, = ^^if^- It follows then that 



1 - 1 - e^/" cos(27r/j) - ie^/" sin(27r/j) 
n (1 - eV" cos(27r/j))2 + (e^" sin(27r/j))2 

1-eV 1 - e4/"cos(27r/j) . e^/" sin(27r/j) 



n 



— z- 



1 + (e4/")2 - 2e4/" cos(27r/j) 1 + (e4/")2 - 26^/" cos(27r/j) 



Therefore 



^(/.) = 2 



and 



= 2 



1 - (e-^/" - cos(27r/j)) 
n + e^/" - 2 cos(27r/j) 

sin(27r/j) 



1 - e 



n + e^/" - 2 cos(27r/j) ' 

For a given (i„, define the n x (n — dn) matrix Xoxci = (^i^cxci' ^Lci) with components 



a^excit, = sin (27rt/j) 



and 



a^Li,, = cos {2-ntfj) 



for j = (i„/2 + 1, . . . , n. Based on this notation, the n x 1 mean vector can be written as 



A* — + Xexclf3exdi 



52 



where 

/3 = [A{0) AUi) ■ ■ ■ A{U^„) B{h) ■ ■ ■ BUd./2)r 

and 

f3e.cl - [^(/d„/2+l) • • • A(/n/2) i^(/d„/2+l) " " " 5(/n/2-l) 

For this example, consider 

nR0a) > \\n- H^nW^ 

^ I I^e3;d/3ea;dl P 



— 2 exclf^ exd 



2 1 E Mfjr+B{f,r]+'^A{U/.f 

/2<j<n/2 



_n{2{l-e')f [ sin(27r/,„/2+i) 



e-4/n + e4/n_2cOs(27r/rf„/2+l) 



. ci / sin(27r/rf„/2+i) 



n2 V2(cos/i(4/n) - 1) + 2(1 - cos(27r/,„/2+i) 

for some positive constant c\. To simphfy notation, define 

, ^ ( sin(27r/rf„/2+i) 

" n2 \2{cosh{A/n) - 2) + 2(1 - cos(27r/d„/2+i) 

If (i„ — > oo, then lim^^oo hn/d^ < oo. It follows that 
for some constant ki > 0. 
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